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Investigation de politica educativa en la era de los grandes datos: Fronteras metodologicas, 
equivocos, y desafios 

Resumen: A pesar de la abundancia de datos y del aumento de la disponibilidad de datos traldos 
por los avances tecnologicos, hubo estudios de pollticas educativas muy limitadas que usaron datos 
importantes, caracterizados por gran volumen, gran variedad y alta velocidad. En base al reciente 
progreso del uso de grandes datos en investigaciones de pollticas publicas y de ciencia social 
computacional, este trabajo pretende demostrar el potencial de datos importantes y la gran cantidad 
de datos que pueden ser utilizados en la investigation de pollticas educativas. En primer lugar, 
introduzca datos importantes que son potencialmente relevantes para la investigation de pollticas 
educativas. Puedo, entonces, presentar fronteras metodologicas, examinando los supuestos 
conceptos clave, meritos y salvedades de tres enfoques analiticos comunmente usados en la mineria 
de grandes cantidades de datos de texto: modelos de topicos, analisis de conexiones textuales y 
analisis de sentimientos. A continuation, para garantizar la veracidad del uso de grandes datos en la 
investigation sobre pollticas educativas, desenmascaramos tres equivocos metodologicos. Este 
articulo concluye con una discusion sobre el desarrollo de la capacidad de investigation 
interdisciplinaria abordando las preocupaciones de privacidad y los enigmas eticos a medida que 
exploramos una agenda de investigation de uso de datos importantes en la politica educativa. 
Palabras-clave: datos grandes; politica educativa; analisis de conexiones textuales; analisis de 
sentimientos; mineria de textos; modelos de temas 

Pesquisa de politica educacional na era dos grandes dados: Fronteiras metodologicas, 
equivocos, e desafios 

Resumo: Apesar da abundancia de dados e do aumento da disponibilidade de dados trazidos pelos 
avan^os tecnologicos, houveram estudos de pollticas educacionais muito limitados que usaram 
dados importantes, caracterizados por grande volume, grande variedade e alta velocidade. Com base 
no recente progresso do uso de grandes dados em pesquisas de pollticas publicas e de ciencia social 
computacional, este trabalho pretende demonstrar o potencial de dados importantes e a grande 
quantidade de dados que podem ser utilizados na pesquisa de pollticas educacionais. Primeiro, eu 
introduzo dados importantes que sao potencialmente relevantes para a pesquisa de pollticas 
educacionais. Posso, entao, apresentar fronteiras metodologicas, examinando os pressupostos 
conceitos-chave, meritos e ressalvas de tres abordagens analiticas comumente usadas na minera^ao 
de grandes quantidades de dados de texto: modelos de topicos, analise de conexoes textuais e analise 
de sentimentos. Em seguida, para garantir a veracidade do uso de grandes dados na pesquisa sobre 
pollticas educacionais, desmascaramos tres equivocos metodologicos. Este artigo conclui com uma 
discus sao sobre o desenvolvimento da capacidade de pesquisa interdisciplinar abordando as 
preocupa^oes de privacidade e os enigmas eticos a medida que exploramos uma agenda de pesquisa 
de uso de dados importantes na politica educacional. 

Palavras-chave: dados grandes; politica educacional; analise de conexoes textuais; analise de 
sentimento; minera^ao de textos; modelos de topicos 


Introduction 

The rise of big data—characterized by large volume, wide variety, and high velocity (boyd & 
Crawford, 2012)—has breathed new life into the social sciences (King, 2011). The progress in 
acquiring, processing, and analyzing big data has enlightened many fields in the social sciences, such 
as political science (Bode, Elanna, Yang, & Shah, 2015; Schwartz & Ungar, 2015), public health 
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(Achrekar, Gandhe, Lazarus, Yu, & Liu, 2011; Bates, Saria, Ohno-Machado, Shah, & Escobar, 

2014), economics (Einav & Levin, 2014), and criminology (Chen, Cho, & Jang, 2015), to name a 
few. In the field of education, despite the fast growing body of literature on learning analytics— 
collecting and analyzing big data to optimize student learning, particularly in online learning 
environment (Baker & Yacef, 2009), there has been limited scholarship on capitalizing on big data in 
education policy. 

To discuss the potential of big data and how big data can be used in education policy 
research, in this commentary I first introduce big data that is potentially relevant to education policy 
research. Given the conspicuous absence of education policy studies using big data, I primarily draw 
upon the literature on big data in public policy and computational social science—the emerging field 
at the convergence of computer science and the social sciences, using computational modeling to 
analyze massive amounts of digital data harvested mostly from digital media sources to study social 
phenomenon (Lazer et al., 2009; Shah, Cappella, & Neuman, 2015; Watts, 2013). Grounded in the 
broad literature relevant to education policy research, I then introduce three methodological 
frontiers of mining massive amounts of text data (i.e., a corpus of texts; Fleuren & Alkema, 2015; 
Hearst, 1999): topic models, network text analysis, and sentiment analysis. In particular, I examine 
the assumptions, key concepts, merits, and caveats of each of the three analytical approaches. Next, 
to ensure the veracity of using big data in education policy research, I debunk three methodological 
misconceptions. This paper concludes with the discussion on developing interdisciplinary research 
capacity and addressing the privacy concerns and ethical conundrums as we explore a research 
agenda of using big data in education policy. 

Big Data in Education Policy Research 

What is big data? In fact, big data is very loosely defined. There is no arbitrary cutoff point 
regarding how big the data must be in order to be considered as big data. Generally speaking, big 
data has three distinct features: large volume, wide variety, and high velocity (boyd & Crawford, 
2012). In addition, some posit a fourth feature of veracity—the trustworthiness and accuracy of big 
data (Bello-Orgaz, Jung, & Camacho, 2015). In this commentary, I draw upon the literature of big 
data in public policy research and computational social science to delineate each of the first three 
features of big data, followed by the potential value of big data in education policy research. I then 
address the fourth feature of veracity as I present the methodological frontiers and debunk potential 
misconceptions. Here I proceed to introduce the first three features of big data: large volume, wide 
variety, and high velocity. 

Large Volume 

The first defining feature of big data is its massive volume. While the name of “big data” 
suggests large volume, there is no arbitrary cutoff point distinguishing “big” from “small” data. 
Oftentimes, the volume of big data is staggeringly large, such as over 118,000 speeches in the U.S. 
Senate (Quinn, Monroe, Colaresi, Crespin, & Radev, 2010), over 16,000 articles in the journal Science 
from 1990 to 1999 (Blei & Lafferty, 2007), 13,246 political blog posts on the 2008 presidential 
election (Roberts, Stewart, & Tingley, 2016), and over 16,000 documents on climate change 
(Boussalis & Coan, 2016). 

In education policy research, the large volume of data can come from a host of sources. 

First, data-driven education accountability systems provide the unprecedentedly large volume of data 
on students, teachers, administrators, schools, and communities. For a ready example, a recent study 
tapped into the record of about 200 million test scores in math and reading to examine educational 
inequalities (Reardon, Kalogrides, & Shores, 2016). Second, digital footprints—digital trace data we 
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create as we use digital services and devices (Howison, Wiggins, & Crowston, 2011)—are another 
source of big data that are valuable to education policy research. The data generated from our online 
activities, either posting tweets on Twitter (e.g., Barbera, 2015) or reading the Facebook newsfeed 
(e.g., Kramer, Guillory, & Hancock, 2014), could serve as the proxy for online public opinion on 
education policy. For instance, the hashtags #CommonCore and #CCSS are among the most 
frequently used hashtags by the digital public when discussing the education policy on the Common 
Core State Standards on social media (Supovitz & Reinkordt, 2017; Wang & Fikis, 2016). One of the 
frequently co-occurring hashtags with #CommonCore and #CCSS is #OptOut, which refers to the 
movement of opting out of high-stakes standardized testing (Wang & Fikis, 2017). At the epicenter 
of the opt-out movement in the state of New York, the Long Island Opt-out group—a major 
advocate for opting out of state standardized testing—has tens of thousands of Facebook group 
members discussing testing and sharing resources on how to opt out (Wang, 2017). Third, 
technological advances in data acquisition increase data availability for education policy research. For 
instance, the optical character recognition (OCR) was used to convert the un-digitalized text data 
into the data that can be read in computers in a study examining the latent topics in 1,539 articles 
published in Educational Administration Quarterly from 1965 to 2014 (Wang, Bowers, & Fikis, 2017). 
Another common data acquisition approach is to use application programming interfaces (APIs) to 
retrieve data generated on the Internet. Many technology companies (e.g., Google, Facebook, and 
Twitter) use APIs to grant others limited access to their data so that more applications can be 
created using their data. Twitter API is one of the most popular APIs among researchers. For 
instance, Twitter API was used to collect 660,051 tweets containing the hashtags of #CommonCore 
and #CCSS to study online public opinion on the Common Core State Standards (Wang & Fikis, 
2017). Fourth, collaborations with technology companies allow academics to access big data. An 
example is a collaborative research undertaking that carried out a randomized controlled experiment 
involving 61 million Facebook users to study social influence within online social networks (Bond et 
al., 2012). All these sources—the data-driven education accountability systems, the digital footprints 
generated by digital services and devices, data acquisition technologies, and collaborations with 
technology companies—provide the unprecedentedly voluminous amounts of data for education 
policy research. 

Wide Variety 

Big data does not simply mean large volume. It also refers to the wide variety of types of 
data, including, but not limited to, videos, images, audios, media coverage, blogs, social media posts, 
online comments, records of government agencies, mobile phone logs, data generated by wearable 
digital devices, and satellite images. These diverse data sources, along with the conventional data 
sources in education policy research (e.g., interviews, observations, artifacts, surveys, and archived 
documents), provide researchers with rich, multi-dimensional insights into the education policy 
issues of interest. 

One potential for using big data in education policy research is to investigate public opinion 
on education policy. Since public opinion is one of the factors shaping policymaking (Burstein, 2003; 
Gormley Jr., 2016; Page & Shapiro, 1983), researchers have studied public opinion on a variety of 
education issues, including the policy in early childhood education (Gormley Jr., 2016), school 
reform (Henderson, Peterson, & West, 2016), school quality (Jacobsen, Snyder, & Saultz, 2014), and 
race-based and wealth-based student achievement gaps (Valant & Newark, 2016). Prior studies on 
public opinion on education and policy used the data collected from surveys, including the EdNext 
annual survey of American public opinion on education (Peterson, Henderson, West, & Barrows, 
2017), the survey administered to the members of YouGov—a company conducting academic 
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survey research and online political polling (Valant & Newark, 2016), and the survey fielded by 
Knowledge Networks to a sample representative of the U.S. population to study the effect of school 
report card formats on public opinion on school quality (Jacobsen, Snyder, & Saultz, 2014). In the 
big data era, in addition to survey data, a growing number of studies have examined public opinion 
on policy issues using social media data—the data generated by the digital public as they discuss 
policy issues on social media. A few examples could suffice. Over 120,000 tweets were collected to 
examine public opinion on health reform (King et al., 2013). Chung and Zeng (2015) developed a 
system called ‘iMood’ to identify opinion leaders, influential users, and community activists on the 
U.S. immigration policy by analyzing about one million tweets. Whitman (2015) detailed how the 
data from Twitter and the Google Trends were used to measure public opinion on the U.S. space 
policy. Reddicka, Chatfieldb and Jaramilloa (2015) examined public opinion on National Security 
Agency massive surveillance programs of Americans through a critical discourse analysis of tweets 
along with the survey data collected from a randomly sampled public opinion poll of Americans. 

In addition to social media data, researchers have been exploring how to use other sources 
of big data, such as mobile phone metadata and satellite images. Toole and his team demonstrated 
the potential of using mobile phone metadata to improve the forecasts of critical economic 
indicators for governments’ policymaking (Toole et al., 2015). Toole et al. used the mobile phone 
metadata (e.g., who called whom, the location of the cell towers through which the calls were made, 
the time of the calls, the total number of calls, and the number of incoming and outgoing calls) as a 
proxy to detect the changes in mobility and social interactions caused by layoffs, and then predicted 
the aggregated unemployment rates months before the official reports were released. Moreover, to 
leverage the value of image data, a novel approach is to use satellite-recorded nighttime lights to 
estimate the poverty and economic growth (Henderson, Storeygard, & Weil, 2012; Pinkovskiy & 
Sala-i-Martin, 2015; Xie, Jean, Burke, Lobell, & Ermon, 2016). Suffice it to say, these examples 
present the tantalizing potential of using big data in education policy research. 

High Velocity 

The third feature of big data is high velocity: the high speed of data generation. Unlike 
survey datawhich are generated periodically whenever the survey is administered, much of the big 
data—such as web browsing, Facebook status updates, YouTube videos, phone call logs—are 
generated in a constant stream. High-velocity streaming data pose both methodological challenges 
and opportunities of processing the constant incoming data, particularly processing data in a timely 
fashion, so that the data can be analyzed in real-time or near real-time. For instance, a group of 
researchers capitalized on the simplicity (i.e., no more than 140 characters) and immediacy features 
of tweets to detect the traffic events in urban areas (Gutierrez, Figuerias, Oliveira, Costa, & Jardim- 
Goncalves, 2015). In education policy research, a common limitation is the time lag of months, if 
not years, between data collection and result report. In the big data era, one solution to the time lag 
limitation is to capitalize on diverse data sources and automate or semi-automate data processing, 
analysis, and visualization, so that raw data can be processed, analyzed, reported, and visualized in a 
timely manner to inform education policymaking, implementation, and evaluation. In fact, the 
methodological advances in real-time analytics have been fast growing. The three methodological 
frontiers introduced in the following section can all be automated or semi-automated. The real-time 
analytics is definitely an area that those who are interested in using big data in education policy 
research should watch for. 
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Methodological Frontiers 

In the big data era, the wealth of data available to social scientists has been considered as the 
microscope to microbiologists (King, 2011). It is certainly appealing that big data could enrich our 
understanding of social phenomena and relevant education policy issues. Yet big data, as King 
(2016) noted, “is not about the data” (p. iii). Rather, big data is about the analytics —the 
methodological approaches that extract insights from large-volume, wide-variety, and high-velocity 
data. To surmount the methodological challenges brought by big data, emerging analytical tools have 
been rapidly developed at the intersection of the social sciences and computer science to analyze 
massive amounts of digital data (Alvarez, 2016; Lazer et al., 2009; National Research Council, 2013; 
Shah, Cappella, & Neuman, 2015; Watts, 2013). Given the conspicuous absence of education policy 
studies using big data, I primarily draw upon the relevant literature on big data in public policy and 
computational social science to introduce three methodological frontiers—topic models, network 
text analysis, and sentiment analysis—along with their assumptions, key concepts, merits, and 
caveats. It is important to stress at the outset that the methodological approaches introduced here 
are only the commonly used ones (Alvarez, 2016; Feldman, 2013; Verd & Lozares, 2014). They do 
not represent a comprehensive survey of all the methodological tools to analyze big data. Neither do 
they supplant conventional methodological approaches in education policy research. By introducing 
the methodological tools, this article aims to invite education policy researchers to venture into big 
data by applying the tools to education policy research. 

Topic Models 

Topic modeling (Blei, 2012; Blei, Ng, & Jordan, 2003) is one of the prominent, rapidly 
developing methods to infer latent topics in massive amounts of text data. The topic models assume 
that each document has multiple topics, and each topic can be inferred by the probability 
distribution over a set of words. For example, the topic of social justice is described by the words 
such as “inequity,” “justice,” “race,” “disability,” and “bilingual” (Wang et al., 2017). In addition to 
identifying latent topics in text data, topic models have been developed to uncover how the topics 
are related to one another—such as the correlated topic models (Blei & Lafferty, 2007), and how the 
topics evolve over time—such as the dynamic topic models (Blei & Lafferty, 2006). 

The popularity of topic models is partly explained by the fact that they are effective and 
scalable to explore the latent topics in massive amounts of text data. Topic models can be either 
fully automated (i.e., unsupervised) or semi-automated (i.e., supervised) (Roberts et al., 2016). When 
unsupervised topic models are applied, no previous annotations or labeling of the documents is 
needed, as the topics are identified by the high probability words that describe a particular topic. 
Therefore, topic models are highly scalable, without the constraint of human cognition entailed in 
manual coding. Thanks to the scalability, topic models have been applied to analyze the large volume 
of text datasets from a variety of data sources: 24,236 press releases from the U.S. Senate (Grimmer, 
2010), the opinion texts from the U.S. Supreme Court’s 4,321 non-unanimous Court decision from 
1949 to 2006 (Lauderdale & Clark, 2014), over 21,000 scholarly articles on literary studies over the 
last 120 years (Goldstone & Underwood, 2014), and 233,452 online healthcare chat logs (Wang, 
Huang, & Gan, 2016). Recently, topic models have been used for automated annotation of images 
(Feng & Lapata, 2010) and videos (Katsurai, Ogawa, & Haseyama, 2012). 

In education policy research, topic models can be applied to infer latent topics in large 
corpora compiled from multiple data sources. They include records of government agencies, news 
coverage, and social media discourse throughout the policy processes from problem identification 
and framing to ongoing evaluations of existing policies. Since no study was found that has applied 
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topic models in education policy research, here I present a topic modeling study in education 
research. Wang and her team used topic modeling and identified 120 latent topics in 16,524 
documents published in the 116-year history of the longest running journal in education Teachers 
College Record ( TCR ) from 1900 to 2015 (Wang, Bowers, Chae, Fikis, & Natriello, 2017). The 120 
topics were identified by generating two matrices: (1) a matrix of word probabilities by latent topics, 
and (2) a matrix of topic proportions by TCR articles (see Blei, 2012, for a thorough explication of 
probabilistic topic modeling). Among the 120 topics in education literature over the last 116 years, 
the topic of education in wartime disappeared after the end of cold war in the 1980s; the topic of 
social justice has been on the rise since the 1950s; the topic of measurement of student achievement 
has garnered persistent attention in the education research community since the 1900s. 

Researchers who apply topic models to education policy research need to be aware of a 
couple of caveats. First, topic models, albeit valuable, are only a blunt tool to explore large corpora. 
Unlike the fine-grained results generated by manual coding, the results of topic models are a coarse¬ 
grained description of the text datasets. Further, topic models, even unsupervised topic models, 
necessitate researchers’ subjective decisions on modeling and choosing the best fitting model. Most 
topic models do not run on every single word in the datasets. Rather, before running topic models, 
the text data need to be processed and cleaned. In this data cleaning process, a common practice is 
to remove the words that do not convey topical meaning or the words that are used most and least 
frequently. To do so, scholars may take different approaches, as “there is no globally best method” 
(Grimmer & Stewart, 2013, p. 3). A few examples would suffice. Blei and Lafferty (2007) removed 
the words by two criteria: the words occurred fewer than 70 times, and the 296 stopwords such as 
“a,” “an,” “the,” and “around”. Roberts et al. (2016) removed the words that occurred fewer than 
1% of the 13,246 blog posts. Grimmer (2010) removed the words that occurred fewer than 0.5% 
and over 90% of the documents, as well as the stopwords. Grim and Hornik (2011) calculated the 
mean term frequency-inverse document frequency (tf-idf), and then only included the words that 
have a tf-idf value slightly above the median to remove the very frequently used words. All these 
examples demonstrate how running topic models entail the researchers’ modeling decisions. 
Moreover, researchers interpret the topic model output and oftentimes label the topics through 
drawing upon the researchers’ knowledge on the context of the text data. For instance, to label each 
topic, the researchers examined the topics that have been identified periodically in the existing 
literature in the field, and took account into the results generated from the topic models: the high- 
probability terms and the high-probability articles (Wang et al., 2017). This process of data 
interpretation relies on the researchers’ contextual knowledge of the data. For this reason, topic 
models, while do not replace humans, “amplify human abilities” (Grimmer & Stewart, 2013, p. 4). 

To that end, the most promising way of automated text mining is not to negate the researchers’ need 
to read the texts, but rather “to identify the best way to use both humans and automated methods 
for analyzing texts” (Grimmer & Stewart, 2013, p. 4). 

Network Text Analysis 


Network text analysis is another emerging methodological frontier to analyze text data. A 
network is formed by nodes and ties (Borgatti, Everett, & Johnson, 2013). To conceptualize texts as 
networks, the units of texts (i.e., words or concepts) are connected by the ties (i.e., the co-occurrence 
of words or the relationships between concepts; Verd & Lozares, 2014). Network text analysis 
assumes that the semantic meaning of texts is revealed by the patterns of network stmcture—how 
the units of texts are connected in the network (Diesner & Carley, 2005). For instance, in the 
network text analysis of stem-cell research literature, the words—such as “bone,” “marrow,” 
“transplantation,” “tissue,” and “peripheral”—are tightly connected by their co-occurrence ties 
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(Leydesdorff & Hellsten, 2005), denoting that the co-occurrence relationships between the tightly 
connected words are stronger than those words with others. In the network analysis of texts 
compiled by 1.9 billion anonymous queries on epilepsy, the nodes are the words that are connected 
by the co-occurrence ties; the tightly connected clusters of words suggest the topics, including the 
seizures and their effects on the body, anti-epileptic drugs and their side effects, and the life changes 
(e.g., driving and employment) related to epilepsy (Miller, Groves, Knopf, Otte, & Silverman, 2017). 
In addition to conceptualizing words as the units of texts in the networks, researchers also consider 
hashtags as the units of texts in analyzing the text data acquired from social media. For instance, in 
the network analysis of 660,051 tweets containing the hashtags #CommonCore and #CCSS (the 
Common Core State Standards), the nodes are the hashtags that are connected by the co-occurrence 
ties; the tightly connected clusters detected by the faction algorithm suggest the online discourse on 
the Common Core State Standards on Twitter centered around the topics, including the Republican 
Party’s 2016 presidential candidates (e.g., #Trump2016 and #TedCruz2016), anti-Common Core 
(e.g., #StopCommonCor and #StopCC), education policy and reform (e.g., #NCLB and #ESSA), 
as well as teaching and testing (e.g., #teaching and #testing) (Wang & Fikis, 2017). Another novel 
approach to conceptualizing texts as the networks is to consider nodes as the nouns that are 
connected by the verbs as ties. For instance, in the network analysis of 130,213 news articles on the 
2012 U.S. presidential elections, the nodes are the nouns such as Obama, economy, and efforts, and 
the ties are the verbs such as celebrate, acclaim, and blame (Sudhahar, Veltri, & Cristianini, 2015). 

Both network text analysis and the aforementioned topic models are emerging 
methodological approaches to analyze text data. One might wonder: If we use the two methods to 
analyze the same text dataset, do their results differ? The answer, according to the literature 
(Leydesdorff & Nerghes, 2015; Wang & Fikis, 2016), is affirmative. In fact, the two methods can 
yield the results that are significantly uncorrelated. This by no means suggests topics models and 
network text analysis are invalid. Rather, it suggests the methods work well if applied for different 
purposes: the topic modeling is more appropriate to reveal similarities, whereas network text analysis 
is more appropriate to reveal semantic meanings. The different results yielded by using topic models 
and network text analysis are analogous to the different results generated from using different 
conceptual frameworks and coding schemes when manually coding the same dataset in qualitative 
research. 

It is worth noting that when analyzing the hashtag co-occurrence networks, it is of critical 
importance to select the appropriate hashtags. The hashtags might keep changing and evolving over 
the process of education policymaking and implementation. Correspondingly, the appropriate 
hashtags should be broad enough to include all permutations of the words and phrases relevant to 
the topic of interest. However, if it is too broad, there is a risk of including irrelevant content, adding 
noise to data. Therefore, an iterative process is recommended to examine the data retrieved using 
the selected hashtags. 

Sentiment Analysis 

The third methodological frontier that holds great promise in education policy research is 
sentiment analysis, also known as opinion mining. Sentiment analysis identifies the sentiment and 
emotions in text data through detecting emotion-bearing words (Liu, 2010). For instance, the 
negative sentiment is considered to be expressed in the tweet “Fear, retaliation ruled @UserID HR 
department, employees say”, because of the negative emotion-bearing words (i.e., fear and 
retaliation) were used; the positive sentiment is considered to be expressed in the tweet “Thank you 
for surprising one of our amazing teachers today!”, because of the positive emotion-bearing words 
(i.e., thank and amazing) were used. Sentiment analysis can be applied at various levels (e.g., 
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document, sentence, and aspect) using sentiment lexicons such as SentiWordNet (Feldman, 2013). 
With the wealth of digital data, sentiment analysis has emerged as a supplement to the existing 
methods (e.g., surveys and interviews) to gauge public opinion—an aggregate of individual views, 
attitudes, and beliefs about a particular topic expressed by the digital public. 

The automated sentiment analysis has been applied to research in many fields to examine 
public opinion. In business, sentiment analysis has been used to evaluate the sentiment in financial 
news articles to predict stock prices (Schumaker, Zhang, Huang, & Chen, 2012). In political science, 
sentiment analysis was conducted to measure the public opinion on the 2008 Obama-McCain debate 
(Fernandez-Gavilanes, Alvarez-Lopez, Juncal-Martinez, Costa-Montenegro, & Gonzalez-Castano, 
2016). In the field of public policy, sentiment analysis was conducted to gauge the public opinion on 
the U.S. immigration policy (Chung & Zeng, 2015) and crime problems (Jeremy, Paul, Krone, 
Spiranovic, & Cockburn, 2015). In the field of education policy, sentiment analysis was conducted to 
investigate the public opinion on the Common Core, and it was found that the negative sentiment 
surpassed the positive sentiment in all 50 states and the District of Columbia (Wang & Fikis, 2017). 

If sentiment analysis is used to analyze social media data, researchers can often pinpoint the 
geographical locations of social media users by geographic identifiers, including geotagged locations 
and self-reported locations. Prior literature has consistently suggested that approximately 1% of 
tweets are geotagged, thus providing the data on latitude and longitude of the locations where the 
tweets are posted (e.g., Jahanbakhsh & Moon, 2014; Mislove, Lehmann, Ahn, Onnela, & 

Rosenquist, 2012; Ram et al., 2015; Young, Rivers, & Lewis, 2014). In addition, around 70% of 
Twitter users self-report their geographic locations on their Twitter profiles (e.g., Mislove et al., 

2012; Wang & Fikis, 2017). These data on locations, along with the data on the time stamps of social 
media posts, offer researchers opportunities to examine when and how public opinion emerge, 
fluctuate, and evolve at different geographical scales such as states, congressional and legislative 
districts, and large cities and towns. Moreover, when social media users post selfies like more than 
half of millennials have already done in the United States (Taylor, 2014; Wilson, 2014), the public 
opinion can be examined at an even more granular level by demographic groups such as African 
Americans, Hispanics, and Asian Americans. More intriguingly, public opinion is not a static entity. 
Rather, public opinion is contagious (Kramer et al., 2014), and they evolve throughout the 
policymaking and implementation process. As a corollary, sentiment analysis can be an instmmental 
tool in education policy research to glean insights into how education policy affect the public, and 
the interplays among education policy, public opinion, and policy outcomes in diverse political, 
economic, and cultural contexts. 

Like other methodological approaches introduced earlier, sentiment analysis is a crude tool 
to detect sentiment in large-volume text datasets. Other than that, the automated sentiment analysis 
has a few unique limitations. First, the fully automated sentiment analysis does not detect sarcasm 
well. Sarcasm is a sophisticated expression, in which “praises” carries negative sentiment 
(Altrabsheh, Cocea, & Fallahkhair, 2015). Take the tweet “Brilliant! The real agenda behind 
#CommonCore!” as an example. The algorithms in sentiment analysis might deem “brilliant” as a 
positive-emotion-bearing word and categorize erroneously the tweet with positive sentiment. The 
second limitation of sentiment analysis is the lack of contextual understanding. Without it, the 
sentiment analysis algorithms might categorize the tweets “Testing is so green.” and “Opt out is so 
White.” as the tweets with the neutral sentiment, which might be in disagreement with manual 
coding results (“Testing is so green” might be manually coded as “testing industry wring profits 
from public education”; “Opt out is so white” might be manually coded as “The students and their 
parents who advocate for opting out of standardized testing are predominantly White”). The third 
limitation is the exclusion of the content on the webpage directed by the hyperlinks in tweets. 
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Consider the tweet “Must read, http://t.co/eUJqNEGZSm Is this where we are headed? Lots to 
think about.” The text data in this tweet itself suggest neutral sentiment, but the content on the 
webpage directed by the hyperlinks in the tweet carries negative sentiment towards the proposed 
changes in the Elementary and Secondary Education Act (ESEA) Reauthorization. Given the 
limitations, sentiment analysis often needs some portion of manual coding to ensure the veracity of 
data analysis. 


Debunking the Misconceptions 

For all the richness of big data and emerging methodological tools at our disposal, if used 
properly, big data could be a gold mine of education policy research. However, big data is not a 
panacea. The label of “big” renders big data prone to misunderstanding. The fundamental premise 
to establish the veracity of big data—the fourth feature of big data—lies with not only aptly applying 
the aforementioned methodological tools and many others, but also exercising caution against the 
misconceptions of big data. The three misconceptions discussed below have been noted in the 
literature in public policy and computational social science. These misconceptions have the potential 
to misguide education policy researchers as they use big data. Here to ensure the veracity of big data 
in education policy research, I debunk the misconceptions as a cautionary note for future inquiry. 

Bigger is Not Necessarily Better 

The first misconception is that bigger is better. Not true! Big data, indeed, has much value to 
offer; yet it may delude us into thinking bigger is better. In fact, bigger is not necessarily better. Big 
data—the large-volume, wide-variety, and high-velocity data—inherently implies more noise in data, 
rendering more efforts to find the true patterns in data because “the signal-to-noise ratio may be 
waning” (Silver, 2015, p. 447). More importantly, big data and “small data” are not mutually 
exclusive. The techniques in traditional statistics are indispensable to examine the validity and 
reliability of big data analytics. 

Take sampling as an example. Some argue that sampling may outlive its usefulness as we can 
now access samples so large that N ~ all (i.e., samples are close to population; Mayer-Schonberger & 
Cukier, 2013). Surveying the literature in the big data era, it is not uncommon to see the studies that 
analyzed multi-million pieces of data. For instance, 46 billion words posted by 63 million unique 
Twitter users were considered as the social sensors of human happiness (Dodds, Harris, Kloumann, 
Bliss, & Danforth, 2011); a corpus that contained 509 million tweets posted by 2.4 million Twitter 
users from 84 countries was examined to detect people’s seasonal mood changes (Colder & Macy, 
2011); 9 million tweets posted by the Twitter followers of candidates for the U.S. House and Senate 
and governorship in 2010 midterm elections were used to study the digital public’s political 
expression (Bode et al., 2015). However, the “N ~ all” view has been challenged by many skeptics 
who have called to caution the biased sampling, particularly in those studies relying on specific social 
networking sites (boyd & Crawford, 2012; Hargittai, 2015; Sandvig, 2015). In other words, the 
statement “we examined how 10.1 million U.S. Facebook users interact” does not indicate a 
representative sample of the U.S. population, but rather the 4% of Facebook users who self- 
reported their political affiliation on their Facebook profile page (Sandvig, 2015). In fact, people do 
not randomly opt into the use of Facebook and social networking sites in general (Hargittai, 2015). 
Whether people use a specific social networking site is associated with a host of factors, including 
age, gender, ethnicity, education, and income. Specifically, younger people are more likely than older 
people to use three popular social media sites—Facebook, Twitter, and Linkedln; those with more 
education and higher income are more likely to use these three sites than those who have lower 
socioeconomic status; women are more likely to use Facebook but less likely to use Linkedln; 
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African Americans are more likely to be on Linkedln and Twitter, but less likely to use Facebook 
than Whites (Hargittai, 2015). As a corollary, the online population demographics might shift 
constantly, and do not represent the ones in the offline world (Diaz, Gamon, Hofman, Kiciman, & 
Rothschild, 2016). 

More importantly, in an article titled “How Big Data is Unfair”, Hardt (2014) noted that the 
algorithms are apt to “favor those who belong to the statistically dominant groups” (para. 9), and “a 
variable that is positively correlated with the target in the general population might be negatively 
correlated with the target in a minority group” (para. 12). Therefore, despite a large N, the sheer 
number of sample size alone is not the factor substantiating the claim that N ~ all. Researchers must 
be cautious when making generalized claims based on a biased sample, as it is problematic to 
generalize the patterns in one demographic group at one place at one time to all demographic 
groups at all places at all times. 

Theoretical Framing Is Important As Always 

The second potential misconception of using big data in education policy research is that 
theoretical framing may become obsolete. Some predicted that the scientific approach of 
hypothesizing, modeling, and testing would become obsolete, because patterns can be revealed by 
crunching data through algorithms without being guided by theories (Anderson, 2008). This view is 
unfounded, because the revealed “patterns” might not exist and the “enormous quantities of data 
can offer connections that radiate in all directions” (boyd & Crawford, 2012, p. 668). The so-called 
“patterns” are deemed as “chance associations”—the discernable associations that would inevitably 
show up if we look at enough data (Leinweber, 2007). Without appropriate theories to frame and 
guide data mining, it is easy to fall into the trap of “data mining sins” (Leinweber, 2007, p. 15). An 
example is that the Google Flu Trends (GFT) was touted that it could predict flu by matching 50 
million Google search terms to over 1,000 terms suggesting the propensity of flu, and the GFT 
could generate the flu prediction two weeks before the U.S. Centers for Disease Control and 
Prevention did. However, the GFT failed to predict the non-seasonal flu in 2009, partly because the 
GFT “was part flu detector and part winter detector” (Lazer, Kennedy, King, & Vespignani, 2014, p. 
1203). This is why boyd and Crawford (2012) argued that data, big or small, lose value when they are 
taken out of context. In education policy research, the socially-constructed data should never 
override theory. Theory matters. Context matters. These principles apply to big data analytics as 
well. Technology truly helps satiate our voracious appetite for data; computer science provides the 
capacity to collect and analyze voluminous amounts of data. But still, in the big data era, theories 
and data are inextricably linked. Data alone tell us nothing. Data are merely a vehicle we can lean on 
to enrich our understanding of social phenomena. When we simplify human behaviors to numbers, 
we run the risk of losing the richness of human behaviors. It is the researchers’ job to interpret data: 
unpacking the meaning behind the data in a given context (boyd & Crawford, 2012). Therefore, the 
data interpretation, data contextualization, and sense-making process must be guided by theories, 
rather than algorithms. In education policy research, on the one hand, theories are analogous to 
beacons guiding the data collection, analysis, and interpretation, so that researchers do not cherry- 
pick variables to blindly hunt for patterns; on the other hand, the results from a theoretical-guided 
analytical approach can help facilitate theory development and enrich our understanding of intricate 
education policy issues. 

Even Automation Needs Human Involvement 

Given the large-volume, wide-variety, and high-velocity features of big data, much of the big 
data analytics is automated by algorithms. The third potential misconception of using big data in 
education policy research is that automation does not need human involvement. This misconception 
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runs the risk of overemphasizing algorithm-enabled automation at the expense of the veracity of big 
data. The sophisticated algorithms, along with the brute force of high-performance computing (also 
called supercomputer), do not mean that valid results are automatically produced after feeding data 
into the magic wand of algorithms. First, human judgment is needed in every step of big data 
analytics in education policy research to establish the veracity of research. The data quality needs to 
be evaluated by humans, as the data feeding into the algorithms can be unrepresentative of targeted 
population and error-prone (Tufekci, 2014). The model parameters used in the automated data 
analysis need to be set by humans, as the ones noted earlier in the section of topic models. The data 
interpretation entails researchers to draw upon their prior knowledge in the field to make sense of 
the data. As a corollary, to tap the potential of big data in the field of education policy, the key is not 
to blindly pursue clever algorithms or troves of data, but to augment automation and human 
judgment, so that they are complementary to each other. 

Challenges 

Amid the abundant potential of big data, pressing challenges abound in using big data in 
education policy research. Among them are developing interdisciplinary research capacity, and 
addressing the privacy concerns and ethical conundmms. Here I discuss these two challenges that 
may deter using big data in education policy research. More importantly, I encourage future 
researchers to offer viable strategies to unleash the potential big data in education policy research. 

Developing Interdisciplinary Research Capacity 

The first challenge we face is the deficiency of education policy researchers who are well 
versed in the fields of both education policy and computer science. In its essence, using big data in 
education policy research entails interdisciplinary efforts, in which the knowledge in education policy 
provides vital theoretical underpinnings that frame the studies, and the expertise in computer science 
enables the algorithmic approaches to data acquisition, mining, and analysis to scale up analytical 
capacity. Unfortunately, education policy researchers are often inadequately trained in computer 
science. 

To surmount this challenge, I propose two viable solutions to build and maximize the 
research capacity in using big data in education policy research. The first solution is to develop 
interdisciplinary research teams. Stepping out the silos in individual researcher’s disciplinary 
boundary, the researchers in different fields—such as education policy, computer science, and data 
science—collaborate and marshal the intellectual capital on data acquisition and analysis, as well as 
interpreting data in rich social contexts. Another solution to the deficiency of interdisciplinary 
research capacity is to build a pipeline of ambidextrous researchers who are deft in both education 
policy research and computer science. In fact, government agencies have already taken the initiatives 
to train aspiring socially-minded computer scientists and computationally-minded social scientists. 
For instance. National Science Foundation has been funding the initiatives to train interdisciplinary 
big data researchers (NSF, 2012). Moreover, many universities have begun building interdisciplinary 
programs that bring together social science, computer science, statistics, and data visualization 
(Wallach, 2015). It is thus recommended that future education policy researchers participate in such 
interdisciplinary training programs so that they will be well equipped to apply computational social 
science to education policy research. 

Addressing Privacy Concerns and Ethical Conundrums 

In a data-rich environment, researchers have been wrestling with privacy concerns that 
derive from the inherent tension between data and privacy. The word “data” is a plural of the Latin 
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word “datum” which means “something given”, whereas privacy refers to “the ability to control the 
dissemination and use of one’s information” (Wright et al., 2003, p. 148). As a result, big data and 
privacy have inherently contradictory goals (O'Leary, 2015). With Internet-connected digital devices 
at disposal, people might trade privacy for convenience. We disclose our whereabouts when we hail 
a ride with Uber or use Google Map to get to our destination. We share a part of our life when we 
use Facebook to stay in touch with families and friends. Further complicating the privacy issue is 
that once we share something online, either a picture or a comment, that “something” then has a life 
of its own on the Internet. That is, we no longer have much control over how the data about us are 
disseminated. In this digitally hyper-connected world, everything about our online activities, as boyd 
(2010) aptly put it, is “public by default, private when necessary” (boyd, 2010, para. 1). Still, for our 
researchers, as computerized algorithms allow us to scale up data collection and analysis through an 
automated process, we are now presented with the mounting challenges on how to strike a balance 
between strengthening the oversight of data access and advancing scientific discoveries. 

Further, ethical conundrums loom large as we capitalize on big data in education policy 
research. First, readily access to data does not necessarily mean that the data are meant to be 
consumed by anyone (boyd, 2012). Rather, the easy access to big data highlights the need for greater 
oversight to deter prying eyes on personal data and prevent the potential abuse of massive datasets. 

A recent controversial study was about the massive-scale experiment (n — 689,003) on emotional 
contagion on Facebook (Kramer et al., 2014). To test whether emotions were contagious through 
Facebook’s social networks without face-to-face communication or non-verbal cues, Kramer et al. 
manipulated the extent of the emotional content of Facebook users’ News Feed. Kramer et al. stated 
that the data collection and analysis process was “consistent with Facebook’s Data Use Policy, to 
which all users agree prior to creating an account on Facebook, constituting informed consent for 
this [Kramer et al.’s] research” (Kramer et al., 2014, p. 8789). This leads to a critical question: Are 
social media sites’ Terms of Service the de facto informed consent for social experiments in the 
digital age? To date, this question has not received an agreed-upon answer yet. Most Institutional 
Review Board Protocols have not provided adequate guidelines on conducting large-scale social 
experiments on social media sites. This is particularly problematic when human participants are not 
fully informed and are not explicitly asked for informed consent for online social experiment 
participations. 

Another ethical challenge centers on the use of social bots in the experiments. Social bots 
are the social media accounts that are operated by algorithms, instead of humans, to automate the 
content generation and interaction with other human users on social media (Ferrara, Varol, Davis, 
Menczer, & Flammini, 2016; Kollanyi, Howard, & Wooley, 2016). Recently, social bots have been 
increasingly used to contribute to and even steer the direction of online discourse on political 
elections and public policy issues (Bessi & Ferrara, 2016; Ferrara et al., 2016). A prime example is 
that the automated pro-Trump bots were used more aggressively than pro-Clinton bots during the 
U.S. presidential debates in 2016, and particularly in the final presidential debate the pro-Trump bots 
out-produced seven times more traffic on Twitter than the pro-Clinton bots (Kollanyi, Howard, & 
Wooley, 2016). Thus, social bots can potentially have a bearing on shaping and even intentionally 
manipulating the contours of online discourse, wielding subtle and even significant influence on 
public opinion and policy issues. In the research realm, the social bots have been used as the “virtual 
confederates” in social experiments conducted online to study human social behaviors, like the 
confederates—the people trained by the researchers to follow the pre-assigned scripts in social 
experiments—used in Stanley Milgram’s experiment as the pedestrians walking on the street and 
looking at the balcony to examine how many other pedestrians followed the confederates’ behavior 
(Milgram, Bickman, & Berkowitz, 1969), and in Solomon Asch’s (1956) conformity experiment. As a 
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result, the ethical challenges on the informed consent, deceptions, and direct harm must be 
addressed as the bots are used as the “virtual confederates” in the experiments to control the 
variables and create an artificial context (Krafft, Macy, & Pentland, 2016). 

Concluding Remarks 

Big data research has been growing by leaps and bounds. In education policy, researchers 
can collect big data from an array of sources, including, but not limited to, the data-driven education 
accountability systems, the digital footprints left by the digital services and devices used by education 
stakeholders, and data acquisition technologies such as the OCR converting the un-digitalized text 
data into digitalized data and API accessing data shared by technology companies. With the rich 
data, new models are rapidly being developed to analyze texts, videos, images, and audios, mobile 
phone call log data, and satellite image data. The wealth of data and analytical tools allow researchers 
to examine education policy research questions that might not be easily explored in the past. The 
three methodological approaches (topic models, network text analysis, and sentiment analysis) 
introduced in this commentary hold great potential to the real-time or near-real-time analysis of big 
data in education policy. In doing so, the timely results can be offered to inform education 
policymaking, implementation, and evaluation. For education policy researchers to venture into big 
data, it is important to develop interdisciplinary research capacity, as well as address the privacy 
concerns and ethical conundrums. This paper did not take a systematic approach to 
comprehensively surveying all methodological tools used in analyzing big data. Rather, to ensure the 
empirical examples presented in this paper are applicable in education policy research, this 
commentary introduces three analytical approaches used in the closely related fields of public policy 
and computational social science in an effort to invite education policy researchers to venture into 
big data. As we embrace big data in education policy research, many hurdles remain and new 
obstacles will emerge. It is the hope that this commentary could invite forward-looking discussions 
and the explorations of a research agenda of using big data in education policy. 
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